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1.  Introduction 


The  standard  for  acquiring  knowledge  in  institutional  training  within  the  US  Army  is  split 
between  traditional  classroom  training  and  live  training.  These  methods  are  used  to  test  recall 
and  allow  Soldiers  to  apply  and  test  their  skills  respectively  in  varying  conditions  and  against  a 
set  of  standards.  Over  the  last  30-40  years,  virtual  simulation  has  been  added  to  the  training 
toolbox  and  a  debate  has  raged  about  what  mix  of  live  and  virtual  training  is  optimal.  To 
augment  institutional  training,  and  provide  flexibility  and  accessibility  for  Soldiers  who  need 
training,  the  Army  has  recently  emphasized  self-regulated  learning;  Soldiers  are  largely 
responsible  for  managing  their  own  learning.  From  a  common  sense  point  of  view,  it  may  not 
seem  practical  for  each  Soldier  to  be  able  to  manage  his/her  learning  without  some  guidance. 

This  guidance,  also  referred  to  as  coaching,  mentoring,  or  tutoring,  is  usually  provided  one  to 
one  by  a  human  tutor.  Generally,  this  function  has  fallen  upon  noncommissioned  officers. 
However,  the  success  of  one-to-one  tutoring  recognized  by  Bloom  (1984,  2a  effect  size)  and 
VanLehn  (2011,  0.8a  effect  size)  are  impractical  to  implement  in  large  organizations  like  the 
Army. 

Once  we  decide  to  pull  the  human  tutor  out  of  the  instructional  loop,  our  alternative  is  to  provide 
one-to-one  computer-guided  instruction  using  intelligent  tutoring  systems  (ITSs),  which  have 
been  shown  to  be  effective  in  promoting  individual  learning  in  static  (e.g.,  desktop),  simple, 
well-defined  (procedural)  domains  (e.g.,  mathematics,  physics).  Well-defined  domains  generally 
have  one  solution  to  a  problem  presented  whereas  ill-defined  domains  may  have  multiple  paths 
to  success.  ITSs  are  a  practical  alternative  to  one-to-one  tutoring  but  are  costly  to  author 
(develop)  and  do  not  have  sufficient  adaptability  to  support  more  dynamic,  complex,  ill-defined 
domains  represented  in  many  Army  operations.  To  address  the  needs  of  learners,  authors,  and 
analysts/researchers  who  use  or  might  use  adaptive  tutoring  technologies  to  learn,  develop  new 
ITSs,  and  analyze  the  effect  of  ITS  technologies,  the  US  Army  Research  Laboratory  (ARL) 
created  the  Generalized  Intelligent  Framework  for  Tutoring  (GIFT)  (Sottilare  et  al.  2012). 

GIFT  is  a  prototype  open-source,  service-oriented,  adaptive  tutoring  architecture  targeted  to 
support  automated  authoring,  automated  one-to-one  and  one-to-many  guided  instructional 
experiences,  and  evaluation  of  effect  to  determine  the  impact  of  current  and  emerging  tutoring 
technologies  with  regards  to  learning  outcomes.  Ultimately  GIFT  will  be  a  community 
development  project.  Currently  there  are  about  400  users  in  30  countries  who  are  registered  users 
of  GIFT,  which  is  freely  available  at  www.GIFTtutoring.org. 

This  report  is  one  of  3  evaluating  the  usability  of  GIFT  from  3  perspectives:  learners,  authors, 
and  researchers/analysts.  This  report  is  focused  on  the  researcher’s/analysf  s  perspective,  which 
is  about  what  people  who  use  GIFT  to  evaluate  adaptive  tutoring  technologies  (e.g.,  tools  and 
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methods)  think  about  their  experienee  and  GIFT’S  ease  of  use  in  facilitating  and  managing  their 
experiment/evaluation  planning,  execution,  and  data  analysis. 

The  evaluation  construct  of  GIFT  is  intended  to  provide  tools  and  methods  in  a  testbed 
environment  to  make  it  possible  to  easily  evaluate  the  effect  (e.g.,  learning  effect,  performance 
effect)  of  various  ITS  technologies  and  methods.  GIFT  currently  supports  experimental  design, 
data  collection,  and  evaluation  as  a  testbed  function  to  compare/contrast  the  effectiveness  of 
adaptive  tutoring  technologies. 

This  report  outlines  researcher/analyst  evaluations  conducted  by  cadets  within  the  Engineering 
Psychology  Program,  part  of  the  Behavioral  Science  and  Leadership  Department  at  the  US 
Military  Academy  (USMA)  as  part  of  their  coursework  in  “Human  Factors  of  Military  Training 
Simulations”  (PL488E)  during  the  2014  Spring  semester. 


2.  Evaluation  of  GIFT  from  a  Researcher’s  or  Analyst’s  Perspective 


2.1  Introduction 

Self-efficacy  scores  and  mood  differences  were  measured  as  an  outcome  of  taking  the  GIET 
logic  puzzle.  They  were  effective  in  viewing  any  changes  to  initial  baseline  ratings.  GIET,  being 
a  completely  new  and  innovative  concept,  was  being  tested  to  view  the  self-efficacy  and  mood 
differences  at  various  times  to  help  gauge  future  testing  procedures  in  an  attempt  to  raise  both 
scores.  Though  self-efficacy  scores  and  mood  seemed  promising,  with  the  exception  of  one  time 
measurement,  they  showed  almost  no  statistical  significance.  They  were  relatively  volatile  and 
hard  to  analyze  because  of  the  rather  vague  interpretation  of  mood  and  self-efficacy  coupled  with 
the  low  sampling  rate  of  3  participants  (users). 

2.2  Research  and  Analysis  Using  GIFT 

In  a  growing  digital  age,  researchers  are  constantly  looking  for  new  ways  to  learn  tasks  in  an 
effective  and  low-cost  manner.  In  the  Army,  where  a  diminishing  budget  is  a  real  threat  to 
efficient  training,  researchers  are  increasingly  searching  for  innovative  ways  to  minimize  cost 
while  maximizing  Soldier  preparation  for  war.  GIET,  an  ITS,  shows  promise  as  a  potential 
solution  for  effective  learning  outside  of  the  traditional  Army  classrooms  and  encompasses  4 
major  components  that  define  the  way  in  which  it  actively  teaches  and  reiterates  knowledge  to  a 
new  user  (Pavlik  et  al.  2013). 

The  first  component  is  known  as  the  domain  model,  as  in  a  distinguishable  set  of  skills  and 
knowledge  and  “is  a  representation  of  all  the  possible  student  states  in  the  domain”  (Pavlik 
et  al.  2013,  p.  39).  The  second  component  is  the  student  model.  It  is  distinguished  as  the  subset 
of  the  domain  model  and  one  that  can  change  throughout  the  course  of  learning.  In  other  words, 
human  states  throughout  the  testing  can  be  inferred  and  interpreted  though  provided  performance 
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data.  The  third  component  is  effectively  known  as  the  pedagogical  model,  which  is  interesting  in 
the  sense  that  it  is  not  a  static  model;  in  fact,  it  is  a  model  meant  to  be  fluid  and  changing  based 
on  the  needs  of  the  user  at  any  specific  point  in  time  of  the  intelligent  tutoring  session.  The 
fourth  and  final  component  is  known  as  the  tutor-student  interface  model  and  is  unique  because 
of  its  changing  media  output  that  is  based  on  the  media  input  of  the  user.  GIFT  uses  and  employs 
all  4  models  with  the  addition  of  an  optional  sensor  module  in  an  attempt  to  optimize  learning 
potential  (Pavlik  et  al.  2013).  In  the  case  of  the  logic  puzzle,  GIFT  used  a  variety  of  the 
components  to  help  the  participant  actively  engage  and  solve  the  problem. 

Moreover,  understanding  GIFT  and  the  implications  of  human-less  tutoring  systems  may 
actually  give  insight  into  the  way  in  which  humans  perceive  the  concept  of  artificial  intelligence 
in  the  future  (Sottilare  2013).  By  using  GIFT  as  a  stepping  stone,  the  future  of  teaching  has  the 
potential  to  be  completely  overhauled.  Statistics  derived  from  mood  ratings  and  self-efficacy 
scores  give  insight  into  the  user’s  experience  and  create  future  perceptions.  In  summary,  as  stated 
by  Sottilare,  “theoretical  concepts  of  today  will  evolve  into  the  practical  implementations  of 
tomorrow”  (Sottilare  2013,  p.  195).  Furthermore,  when  developing  ITSs,  analyses  must  be 
performed  on  empirical  data  to  develop  a  variety  of  key  functions  to  include  measures  of  success 
and  adaptive  instruction  and  support  (Sottilare  2013). 

2,3  Methods 

The  following  sections  describe  the  participants,  apparatus,  procedure,  and  results  of  the 
evaluation  of  GIFT  as  a  research  and  analysis  tool. 

2.3.1  Participants 

The  evaluators  of  GIFT  authoring  tool  usability  are  “firsties”  (senior-level  cadets)  at  USMA. 

2.3.2  Apparatus 

The  apparatus  used  in  this  evaluation  was  the  intelligent  tutoring  system  architecture,  GIFT,  and 
specifically  the  Event  Reporting  Tool  (ERT),  to  collect  user  data. 

2.3.3  Procedure 

Eirst,  3  participants  (“users”,  2  male  and  1  female)  between  the  ages  of  21  and  29  (mean  \M\  = 
24,  standard  deviation  [SD]  =  4.36)  ran  through  the  GIET  Logic  Puzzle  Tutorial  on  an  Alienware 
M17x  laptop.  Each  participant  created  an  individual  profile  and  completed  the  tutorial  in  a  single 
session,  each  session  taking  about  1  hr.  The  sessions  consisted  of  pre-surveys,  the  tutorial,  mid¬ 
surveys,  assessment  questions,  a  logic  puzzle,  and  post-surveys.  The  tutorial  and  logic  puzzle 
consisted  of  matching  various  food  items  to  their  purchasers  based  on  different  types  of  clues 
given  in  the  game.  A  practice  test  was  given  between  the  2  scenarios  focusing  specifically  on 
familiarizing  the  user  with  the  various  clues. 
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The  surveys  gauged  mood  and  self-efficaey,  the  latter  measured  using  a  Self-Effieacy 
Questionnaire  (SEQ).  The  surveys  were  administered  before  the  tutorial  (time  1),  halfway 
through  the  tutorial  (time  2),  and  after  the  logie  puzzle  had  been  completed  (time  3).  The  data 
generated  by  the  sessions  was  then  extracted  into  a  Microsoft  Excel  spreadsheet  using  the  ERT. 
Each  user’s  data  was  organized  into  rows  and  the  measured  variables  (mood/SEQ  at  times 
1,  2,  3,  and  test  score)  divided.  The  appropriate  edits,  such  as  summing  the  SEQ  scores  and 
grading  the  tests,  were  then  made  in  Excel.  Once  these  values  were  found,  they  were  evaluated 
for  their  means  and  standard  deviations.  Then  the  moods  and  SEQs  were  tested  for  statistical 
significance  using  a  paired-samples  t-test.  We  ran  these  tests  in  both  Excel  and  IBM’s  Statistical 
Package  for  the  Social  Sciences  (SPSS)  software.  The  spreadsheet  output  was  then  transferred  to 
a  .pdf  file  and  the  results  were  written  up  and  analyzed. 

2.3,4  Results 

The  statistical  tests  for  mood  supported  the  null  hypothesis,  which  predicted  that  the  values  for 
the  measured  quantity  would  not  change  significantly  during  the  logic  puzzle  course.  According 
to  a  paired-samples  t-test,  there  was  no  significant  difference  found  between  Mood  1  (M=  35.3, 
SD  =  26.7)  and  Mood  2  (M=  32.7,  SD  =  23.2),  t{2)  =  0.01, =  0.845.  According  to  a  paired- 
samples  t-test,  there  was  no  significant  difference  found  between  Mood  2  {M=  32.7,  SD  =  23.2) 
and  Mood  3  (M=  23.0,  SD  =  24.6),  t(2)  =  0.0\,p  =  0.531.  Einally,  according  to  a  paired-samples 
t-test,  there  was  no  significant  difference  found  between  Mood  3  (M=  23.0,  SD  =  24.6)  and 
Mood  1  (M=  35.3,  SD  =  26.7),  t(2)  =  0.01,;?  =  0.651. 

The  SEQ  showed  similar  results  to  mood  except  for  in  one  case.  It  was  hypothesized  that  self- 
efficacy  would  increase  significantly  over  time  during  the  logic  puzzle  course.  As  participants 
continued  through  the  course,  they  should  have  grown  more  confident  in  their  abilities.  However, 
some  of  the  data  did  not  reflect  the  hypothesis.  According  to  a  paired-samples  t-test,  there  was  no 
significant  difference  found  between  SEQ  1  (M=  33.0,  SD  =  1.0)  and  SEQ  2  (M=  42.0, 

SD  =  9.54),  t{2)  =  0.0\,p  =  0.244.  According  to  a  paired-samples  t-test,  there  was  no  significant 
difference  found  between  SEQ  2  (M=  42.0,  SD  =  9.54)  and  SEQ  3  (M=  46.7,  SD  =  2.52), 
t{2)  =  0.0\,p  =  0.565.  According  to  a  paired-samples  t-test,  there  was  actually  a  significant 
difference  in  the  expected  direction  found  between  SEQ  3  {M=  46.7,  SD  =  2.52)  and  SEQ  1 
(M=  33.0,  SD  =  1.0),  t{2)  =  0.0\,p  =  0.009.  In  all  except  for  one  case  the  general  trend  was  SEQ 
increased  as  time  increased. 


4 


3.  Conclusions 


Both  variables  behaved  in  interesting  ways.  Mood  behaved  exactly  as  expected,  in  that  it 
supported  the  null  hypothesis  that  there  were  no  significant  differences.  SEQ,  however,  behaved 
unexpectedly.  It  supported  the  null  hypothesis  between  times  1  and  2  and  times  2  and  3  but 
rejected  the  null  hypothesis  and  showed  significance  between  times  1  and  3.  After  statistical 
analysis  it  was  confirmed  that  mood  ratings  were  inconsistent.  This  makes  sense  because  mood 
is  a  volatile,  nebulous,  and  subjective  quality  that  is  hard  to  quantify.  There  was  also  no  specific 
mood  the  test  looked  to  invoke.  Some  people  may  have  been  happy  at  the  beginning  of  the  test 
while  others  may  have  been  upset.  The  same  could  be  said  of  how  they  finished,  with  some 
finishing  happier  or  more  upset  than  others.  One  would  expect  SEQ  values  to  have  had  statistical 
significance  because  it  makes  sense  that  self-efficacy  increases  as  time  increases.  However,  that 
was  not  the  case  between  SEQ  1  and  2  and  between  SEQ  2  and  3.  This  is  because  one  of  the 
user’s  SEQ  score  actually  decreased  between  the  second  and  third  times.  The  test  between  SEQ  1 
and  3  showed  the  expected  rejection  of  the  null  hypothesis  and  confirmed  that  the  values  were 
significant;  self-efficacy  increased  as  a  direct  effect  of  experience.  If  the  sample  was  large,  the 
results  for  mood  would  likely  also  show  no  significant  differences.  The  results  for  SEQ  would 
have  more  than  likely  shown  more  rejections  of  the  null  hypothesis.  The  content  questions  test 
had  a  perfect  score  of  20  points,  and  overall  the  participants  did  well.  There  was  a  mean  score  of 
17.0  (85%)  with  a  standard  deviation  of  1.0.  The  tutorial  did  appear  to  be  both  relevant  and 
helpful,  and  there  was  only  a  single  outlier,  which  was  user  1  ’s  SEQ  for  time  3. 

Many  interesting  observations  were  made  pertaining  to  GIE T,  a  program  our  group  used  for  the 
first  time.  On  the  whole,  the  actual  test  and  survey  taking,  once  in  the  GIE T  program,  was 
relatively  easy  and  intuitive.  However,  getting  to  the  tests  or  even  opening  the  correct  files 
proved  difficult  and  cumbersome.  There  were  a  few  organizational  issues  regarding  the  interface 
and  design  in  general.  Other  than  generally  being  clunky,  no  specific  icons  were  used  to  access 
the  program;  instead,  a  series  of  folder  operations  and  pathways  were  used  to  access  a  systematic 
approach  to  opening  the  launch  screen.  Assuming  the  polished  GIET  product  does  not  include 
sifting  through  umpteen  levels  of  confusing  files,  the  only  real  prerequisite  skills  needed  for 
GIET  are  a  general  familiarity  with  basic  computer  controls.  In  other  words,  inexperienced  users 
would  need  nothing  more  than  a  help  guide  to  navigate  and  understand  the  complexities  of  the 
system  itself. 

It  was  difficult  and  confusing  to  get  the  correct  data  from  the  ERT,  and  the  data  were  presented 
in  a  bizarre  fashion.  To  begin  with,  the  ERT  was  not  intuitive  to  set  up  and  extract  data  from.  To 
get  the  data  we  needed,  we  had  to  select  options  that  were  not  selected  and  unselect  options  that 
were  selected.  While  this  process  may  not  have  been  difficult  to  someone  who  knows  the  system. 
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it  will  certainly  be  an  issue  for  new  users.  Reeognizing  that  some  of  these  options  were  most 
likely  for  more  advaneed  funetions  and  operation,  the  ERT  eould  be  improved  by  inserting 
buttons  that  allow  simple  funetions  sueh  as  t-tests  to  be  completed  with  a  single  eliek. 

Additionally,  the  output  produeed  in  Mierosoft  Exeel  was  extremely  problematic  and  produced 
doubles  in  every  faeet,  whether  question  or  numerie  answer.  This  output  was  also  set  up  with  3 
eolumns  and  a  large  number  of  rows.  This  made  serolling  though  the  data  diffieult.  Some 
improvements  might  be  to  make  it  easier  to  seleet  the  data  you  desire,  remove  the  duplieates,  and 
show  the  output  in  a  system  with  more  rows  than  eolumns.  It  would  also  help  if  Mierosoft  Exeel 
or  IBM  SPSS  functions  could  be  direetly  used  by  the  program,  mueh  like  Mierosoft  PowerPoint 
is  used  during  Eogie  Tutorial.  Thus  instead  of  computing  numbers  in  those  programs,  the 
information  needed,  sueh  as  mean,  standard  deviation,  and  p-value,  would  be  part  of  the  ERT 
output. 

Erom  our  assigned  perspeetive,  we  would  change  the  front  end  of  GIET  to  inerease  both  usability 
and  aesthetics.  One  example  would  be  making  use  of  the  whole  sereen  while  giving  surveys. 
There  may  have  actually  been  a  threat  to  internal  validity  by  presenting  a  frustrating  sereen  that 
laeked  the  spaee  and  sereen  usage  that  was  allotted  to  the  survey  portion.  The  frustration  derived 
from  having  to  scroll  to  the  right  and  left  to  read  a  question  when  the  entire  sereen  was  virtually 
unused  eould  have  easily  ereated  skewed  seores,  espeeially  the  mood  ratings.  Another  would  be 
to  take  the  logie  puzzle  out  of  Mierosoft  PowerPoint  and  put  it  on  the  HTML  (hyper  text  markup 
language)  display  with  the  surveys,  using  radio  buttons  to  mark  ehoiees  instead  of  an  “x”  or  “o”. 

Another  suggestion  would  be  to  ehange  the  data  output  to  eolumn  format,  whieh  would  be  easier 
to  interpret.  The  most  important  change  would  be  making  GIFT  more  aecessible.  In  summary, 
the  user  should  be  able  to  double-eliek  an  ieon  and  start  using  the  program.  They  should  not  have 
to  open  multiple  files  or  wait  for  green  lights.  To  make  GIFT  easier  to  use  without  a  human  tutor 
there  needs  to  be  some  audio  interaetion.  It  does  not  have  to  be  extensive,  but  small,  simple 
audio  eues  would  be  more  effeetive  than  the  repetitive  text  boxes  eurrently  used.  They  would 
break  up  the  sensory  load  and  eneourage  users  to  be  more  attentive  by  introdueing  a  different 
stimulus. 

The  way  ahead  for  a  program  like  GIFT  lies  in  inereased  and  simplified  user  interaetion.  This 
way  ahead  eould  be  in  the  form  an  avatar  that  takes  the  plaee  of  the  human  in  the  tutoring 
proeess.  Much  like  Apple’s  “Siri”  or  the  Mierosoft  Word  “Paperelip”,  this  avatar  could  be  used 
as  a  mascot  for  the  program  vehiele  to  present  important  and  helpful  messages  to  users  in  a  way 
they  are  more  familiar  with.  Along  with  being  a  guide  through  tutorials,  it  eould  also  double  as  a 
device  to  answer  frequently  asked  questions.  The  avatar  entity  could  increase  both  the  learning 
and  retention  effect  of  the  program.  Users  and  participants  would  have  an  ieon  to  help  them 
remember  and  identify  the  program.  Through  proper  implementation,  users  may  build  a 
relationship  with  this  entity  within  the  program  and  beeome  inereasingly  eomfortable  relying  on 
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the  avatar  as  they  progress  through  various  seenarios.  It  is  important  to  remember  that  the 
absenee  of  a  human  tutor  requires  simplieity  for  the  average  user  to  get  through  the  program 
effieiently. 

A  final  take-away  is  that  human-less  tutoring  does  not  neeessarily  mean  the  program  must  be 
void  of  human-like  interaetion.  In  the  short-term,  this  eould  look  as  simple  as  a  small  objeet  that 
has  been  personified  to  guide  the  users.  However,  in  the  future  this  eould  result  in  the  integration 
of  GIFT  and  full-seale  human  modeling  eomputer  programs.  The  visual  and  auditory  senses 
eould  be  engaged  while  the  user  also  reeeived  haptie  feedbaek  for  physieal  tasks,  thus  making 
GIFT  the  ultimate  human-less  tutor.  GIFT  is  the  beginning  of  a  long  line  of  human-less  tutors 
that  eould  potentially  ehange  the  way  we  train,  the  way  we  learn,  and  even  the  way  we  live. 
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